========================================================
This report/analysis is done on the ProsperLoan dataset, which contains records for 113,937 individual loans with 81 variables.
From wikepedia: Prosper Marketplace, Inc. is a San Francisco, California-based company in the peer-to-peer lending industry. The company operates Prosper.com, a website where individuals can either invest in personal loans or request to borrow money.
From research done on ProsperLoan it is noticed that the business model changed signifcantly post July 2009 when company was registered at the SEC. From very top level exploration this is evident in a number of variables which include N/A for periods before July 2009. Therefore I have made an upfront decision to limit dataset to records after July 2009 which reduced number of records to 83,853 and I have furthermore choosen 13 variables from the 81 in the dataset, on which EDA will be conducted.
The following is a list of variables choosen and summary statistics on same.
## 'data.frame': 84853 obs. of 12 variables:
## $ term : int 36 36 36 60 36 36 36 36 60 36 ...
## $ loanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 4 4 4 4 4 4 4 4 8 ...
## $ borrowerRate : num 0.092 0.0974 0.2085 0.1314 0.2712 ...
## $ estimatedLoss : num 0.0249 0.0249 0.0925 0.0449 0.1275 ...
## $ prosperRating : Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: 6 6 3 5 2 4 7 7 4 5 ...
## $ prosperScore : num 7 9 4 10 2 4 9 11 7 4 ...
## $ listingCategory : int 2 16 2 1 1 2 7 7 1 1 ...
## $ employmentStatus : Factor w/ 9 levels "","Employed",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ isBorrowerHomeowner: Factor w/ 2 levels "False","True": 1 2 2 2 1 1 2 2 1 1 ...
## $ incomeRange : Ord.factor w/ 8 levels "Not displayed"<..: 6 5 8 8 5 5 5 5 7 5 ...
## $ loanOriginalAmount : int 10000 10000 15000 15000 3000 10000 10000 10000 13500 4000 ...
## $ creditScoreRangeAvg: num 690 810 690 750 690 ...
## term loanStatus borrowerRate
## Min. :12.00 Current :56576 Min. :0.0400
## 1st Qu.:36.00 Completed :19664 1st Qu.:0.1359
## Median :36.00 Chargedoff : 5336 Median :0.1875
## Mean :42.49 Defaulted : 1005 Mean :0.1960
## 3rd Qu.:60.00 Past Due (1-15 days) : 806 3rd Qu.:0.2574
## Max. :60.00 Past Due (31-60 days): 363 Max. :0.3600
## (Other) : 1103
## estimatedLoss prosperRating prosperScore listingCategory
## Min. :0.00490 HR: 6935 Min. : 1.00 Min. : 0.000
## 1st Qu.:0.04240 E : 9795 1st Qu.: 4.00 1st Qu.: 1.000
## Median :0.07240 D :14274 Median : 6.00 Median : 1.000
## Mean :0.08031 C :18345 Mean : 5.95 Mean : 3.313
## 3rd Qu.:0.11200 B :15581 3rd Qu.: 8.00 3rd Qu.: 3.000
## Max. :0.36600 A :14551 Max. :11.00 Max. :20.000
## AA: 5372
## employmentStatus isBorrowerHomeowner incomeRange
## Employed :67310 False:40005 $50,000-74,999:25627
## Full-time : 7927 True :44848 $25,000-49,999:24175
## Self-employed: 4538 $100,000+ :15205
## Other : 3806 $75,000-99,999:14498
## Not employed : 649 $1-24,999 : 4654
## Retired : 367 Not employed : 649
## (Other) : 256 (Other) : 45
## loanOriginalAmount creditScoreRangeAvg
## Min. : 1000 Min. :609.5
## 1st Qu.: 4000 1st Qu.:669.5
## Median : 7500 Median :709.5
## Mean : 9083 Mean :708.9
## 3rd Qu.:13500 3rd Qu.:729.5
## Max. :35000 Max. :889.5
##
Note: Variable “CreditScoreRangeAvg” is a calculated average from the 2 variables “CreditScoreRangeLower”/“CreditScoreRangeUpper” in original dataset. This reducing the variables in subset to 12 variables for EDA exercise.
BorrowerRate is showing an approximate normal distribution, but adjusting the bin size and axis scaling, can help show more presice information.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1359 0.1875 0.1960 0.2574 0.3600
Most BorrowerRate’s are situated around 0.15 +/- 0.5 but due to some peaks between 0.25 and 0.35 the mean is approx. 0.2.
Due to some outliers at > 0.2 the distribution of EstimateLoss is left skewed, when removing these outliers we can see most EstimatedLoss’s are in the 0.025 to 0.075 range.
##
## 12 36 60
## 1613 58695 24545
##
## 12 36 60
## 0.01900935 0.69172569 0.28926496
Term is very straigtforward with only 3 options and need only be viewed in table format. 36(months) is by far the most popular option being 69% of total.
LoanStatus includes a number of options with very few records, in order to better understand distribution the “PastDue” options are cut into one total bucket.
##
## Chargedoff Completed Current
## 5336 19664 56576
## Defaulted FinalPaymentInProgress PastDue
## 1005 205 2067
Adding “PastDue” to one bucket does not change picture much, most records are either current or completed, with a noticeable “ChargedOff” at > 5,000 records.
At facevalue this variable does not contain much information being numerical only. In order to explore it has been mapped to the decsription found in the ProsperLoanData variable dictionary.
##
## Not Available Debt Consolidation Home Improvement
## 0.0002357017 0.6267309347 0.0801503777
## Business Personal Loan Student Use
## 0.0624373917 0.0032291139 0.0263632400
## Auto Other Baby&Adoption
## 0.1086349334 0.0023452323 0.0010017324
## Boat Cosmetic Procedure Engagement Ring
## 0.0010724429 0.0025573639 0.0006953201
## Green Loans Household Expenses LargePurchases
## 0.0235230340 0.0103237363 0.0179369026
## Medical/Dental Motorcycle RV
## 0.0035826665 0.0006128245 0.0104298021
## Taxes Vacation Wedding Loans
## 0.0090509469 0.0090863022 0.0000000000
## Not Available Debt Consolidation Home Improvement
## 20 53180 6801
## Business Personal Loan Student Use
## 5298 274 2237
## Auto Other Baby&Adoption
## 9218 199 85
## Boat Cosmetic Procedure Engagement Ring
## 91 217 59
## Green Loans Household Expenses LargePurchases
## 1996 876 1522
## Medical/Dental Motorcycle RV
## 304 52 885
## Taxes Vacation Wedding Loans
## 768 771 0
Debt consolidation is by far the most chosen option for loan > 60% of total and including Auto, Home Improvement and Business close to 90%.
To get a better look at the long-tail data, the second chart has been adjusted with an y-axis log10.
EmployementStatus looks to be a variable of less interest, as options seem to be overlapping and ambigous. Takeaway is that an ovewhelming amount of information given is for being employed in some way.
When removing the “Employed” status, we can get a closer look at the distribution of all other employment status’s. This however does not unlock any further information, data still seems to be overlapping and ambigous.
##
## False True
## 0.4714624 0.5285376
IsBorrowerHomeowner Variable was chosen as I believed it could have a major impact on other variables, but seeing that the split is almost 50/50 i am now less sure that this is a major factor.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 9083 13500 35000
According to www.prosper.com, loans are offered up to $35,000USD so Max of 35,000 is no surprise. Bulk of loans are between $7,500 and $12,500 with a mean of approx. $9,000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 709.5 708.9 729.5 889.5
Bin size and X-axis adjusted to show most CreditScores between 675 and 725, with a mean of 708.9
IncomeRange increasing to $75,000 and then dropping off. Small increase $100,000+ vs. $75,000-99,999 is counter logical, would have expected a further decrease.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 6.00 5.95 8.00 11.00
ProsperRating and ProsperScore assumed closely correlated, but there is a noticeable variance with Prosperrating almost perfectly normal distributed, prosperscore is bi-modal distributed around the Median. These distributions are not surprising given that data is only for loans given, if data had included also loans not approved/funded, the distributions would probably be more positively skewed.
There are 84,853 loan records in the adjusted dataset, with 12 chosen variables (term, loanStatus, borrowerRate, estimatedLoss, prosperRating, prosperScore, listingCategory, employmentStatus, isBorrowerHomeowner, incomeRange, loanOriginalAmount, creditScoreRangeAvg) Variables prosperRating and incomeRange are ordered Factor variables with following levels:
(Worst) —————–> (Best)
prosperRating: HR,E,D,C,B,A,AA
incomeRange: $0,‘Not employed’,$1-24.999,$25.000-49.999,$50.000-74.999,$75.000-99.000,$100.000+
Other observations: * Most loans are with a 36month term * The Mean loan amount is $9.083 * An overwhelming amount of loans are requested for debt consolidation * Loan takers are split approximately 50/50 between homeowners and non-homeowners * Median borrowerrate is 0.1875 and the max borrowerrate is 0.36
The main feature of the dataset is borrowerRate and prosperRating. I would like to analyze what features are best for predicting the borrowerRate. The prosperRating seems to be the most likely candidate.
prosperScore, loanOriginalAmount and creditScoreRatingAvg could logically also have an impact on borrowerRate, while estimatedLoss seems to be a product of borrowerRate or possibly vice-versa.
creditScoreRangeAvg was created as the average of the 2 variables from the orginal dataset “CreditScoreRangeLower”/“CreditScoreRangeUpper”. All variables were changed from the original syntax of uppercase first character to lowercase and over-explained naming simplified. e.g.“ProsperRating..Alpha” changed to “prosperRating”
borrowerRate at first look had a number of peaks, which via adjusting bin-size showed a distibution around 0.18-0.20 with a further peak at approx. 0.32.
listingCategory which in original form was only a numeric value was mapped against the data library description of numerical values, to increase information value. The data was then plotted in a bar chart sorted in decreasing order for count. Data showed an overwhemling amount of loans take for debt consolidation.
loanStatus had a large number of records which were split into a number of buckets which were all described “past due…..”. Data was cut into one total bucket “past-due” to view if this had any significance. Even when pooled together the loanStatus shows very few records which are past-due.
First of all I wanted analyse further the dual peak of BorrowerRate, as I was wondering if this could be further explained via faceting with IncomeRange? But this is definately not the case as almost all distributions are similar to BorrowerRate distribution. But faceting with ProsperRating is definately pointing towards an explantion as BorrowerRate is moving from right to left with improvement in ProsperRating.
## 'data.frame': 84853 obs. of 7 variables:
## $ term : Ord.factor w/ 3 levels "12"<"36"<"60": 2 2 2 3 2 2 2 2 3 2 ...
## $ borrowerRate : num 0.092 0.0974 0.2085 0.1314 0.2712 ...
## $ estimatedLoss : num 0.0249 0.0249 0.0925 0.0449 0.1275 ...
## $ prosperRating : Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: 6 6 3 5 2 4 7 7 4 5 ...
## $ prosperScore : num 7 9 4 10 2 4 9 11 7 4 ...
## $ loanOriginalAmount : int 10000 10000 15000 15000 3000 10000 10000 10000 13500 4000 ...
## $ creditScoreRangeAvg: num 690 810 690 750 690 ...
In order to do a correlation matrix a subset of data is created.
The variables of borrowerRate and prosperRating is as expected very high, as well as between borrowerRate and estimatedLoss.
Further significant relationships could be found between in decending order borrowerRate/prosperScore - borrowerRate/creditScoreAvg - borrowerRate/loanOriginalAmount
And finally between prosperRating/prosperScore.
From this subset I want to look closer at relationship between borrowerRate and prosperRating, prosperScore, loanOriginalAmount, creditScoreRangeAvg, as well as between prosperRating and prosperScore using scatterplots/boxplots and summary statistics.
## $HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1779 0.3134 0.3177 0.3173 0.3177 0.3600
##
## $E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1479 0.2712 0.2925 0.2933 0.3149 0.3600
##
## $D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1157 0.2287 0.2492 0.2464 0.2625 0.3500
##
## $C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0895 0.1765 0.1914 0.1944 0.2099 0.3500
##
## $B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0693 0.1414 0.1509 0.1545 0.1639 0.3500
##
## $A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0498 0.0990 0.1119 0.1129 0.1239 0.2150
##
## $AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04000 0.06990 0.07790 0.07912 0.08450 0.21000
The boxplot and exact detail in summary, further illustrates the very strong realationship between borrowerRate and prosperRating, note how the 1st to 3rd quartiles of borrowerRate per prosperRating, have almost no overlap.
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1099 0.2999 0.3123 0.3021 0.3177 0.3500
##
## $`2`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1050 0.2492 0.2786 0.2712 0.3032 0.3600
##
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0809 0.2100 0.2488 0.2479 0.2925 0.3500
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0716 0.1790 0.2124 0.2254 0.2699 0.3600
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0716 0.1715 0.2199 0.2291 0.3058 0.3600
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0699 0.1535 0.1940 0.2062 0.2599 0.3500
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0659 0.1385 0.1760 0.1851 0.2468 0.3500
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0586 0.1139 0.1449 0.1517 0.1774 0.3600
##
## $`9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0498 0.0946 0.1139 0.1251 0.1435 0.3500
##
## $`10`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04000 0.07160 0.08790 0.09797 0.11590 0.35000
##
## $`11`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06050 0.06590 0.08690 0.09328 0.10990 0.19500
prosperScore does not show the same strong relationsship with borrowerRate as prosperRating, many overlapping borrowerRate’s per prosperScore and definately less linearity.
## $HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.595 5.000 10.000
##
## $E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.695 5.000 10.000
##
## $D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.924 6.000 10.000
##
## $C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 5.468 7.000 11.000
##
## $B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 7.000 6.767 8.000 11.000
##
## $A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 7.000 8.000 8.021 9.000 11.000
##
## $AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 9.000 10.000 9.495 10.000 11.000
This relationship is somewhat surprising, note how median prosperScore is higher for “HR” vs “E”, as well as equal between ‘D’ and ‘C’. From further reading on www.prosper.com it is noted that while prosperRating is calculated using a model ensuring consistency, prosperScore is a custom built tool using not only lenders own data and history, but also history from peer group.
borrowerRate vs loanOriginalAmount is as expected not displaying any significant relationsship, even when adjusting for overplotting etc.
borrowerRate and estimatedLoss after eliminating the outliers above 0.3 estimated loss, shows a very fine liniar relationship
borrowerRate vs. creditScoreAvg however is showing a far less liniar relationsship although trend of lower creditScoreAvg = higher borrowerRate is visible.
## [1] -0.9641819
Finally please note the almost perfect relationsship between prosperRating and estimatedLoss with almost no overlap between prosperRating/estimatedLoss pairs, and only outliers in the “HR” prosperRating.
BorrowerRate correlates strongly with prosperRating and to a lesser extent with prosperScore and creditScoreAvg.
As prosperRating decreases (i.e. from Best to worst) the borrowerRate increases, same relationship with prosperScore and creditScoreAvg. Relationship between borrowerRate and prosperRating seems liniar.
Further interesting relationships are seen between prosperScore and prosperRating as well as between estimated loss and prosperScore.
Also between borrowerRate and prosperScore, showing unusual and counter logical relationsship across prosperScore, indicating that one or more other variables are needed to explain this relationsship. A further investigation on prosper website clarifies this, as stated:
Prosper uses both the custom score (prosperScore) and the credit reporting agency score (creditScoreRangeAvg) together to assess the borrower’s level of risk and determine estimated loss rates (estimatedLoss).
Relationship between borrowerRate and estimatedLoss as well as between prosperRating and estimatedLoss are extremely strong, so strong in fact that i suspect that one is a direct product of the other.
This relationship between borrowerRate/estimatedLoss/prosperRating as well as prosperScore will be investigated further in the next section.
Following the finding in the Bi-variate section I am looking for relationsships, between the main variables borrowerRate/creditScoreRangeAvg and specified via prosperRating or prosperScore. While there seems to be some relationship between these variables, it is only highlevel and not estimated very strong, next I will exchange estimatedLoss for borrowerRate and do same plotting.
Using estimatedLoss definately shows a more solid relationsship, with a narrow band for all prosperScore/prosperRating and estimatedLoss/creditScoreRangeAvg pairings. Only outliers are on the prosperRating “HR” the High-Risk category, which makes sense since this is the last and worst category available and needs to incorporate all High Risk estimatedLoss’s in this bucket.
Building further on this we can look at the relationship on a Mean basis
This further illustrates the strong relationsship between estimatedLoss/creditScoreRangeAvg and prosperRating/prosperScore, but there are some un-expected increases at the highest creditScoreRangeAvg. Perhaps including both prosperRating and prosperScore in analysis, as well as all 5 variables will show a clearer picture.
This definately reuslted in a better fit. For all three options, the high variability is evident in the “HR” bucket and improved estimatedLoss/borrowerRate highly visible when improving score/rating and creditScore. Especially the last chart including all 5 variables seems to be a good illustration of how the 5 variables are connected.
Drawing on the insights from the bi-variate section, it was found, when including all 5 variables, a strong connected relationship, this is somewhat expected in financial loan data, as both borrower and lender will have many other options to lend/invest and consistency/transparency is required.
It was noted that no combination of borrowerRate vs prosperRating/prosperScore/creditScoreRangeAvg, nor estimateLoss vs prosperRating/prosperScore/creditScoreRangeAvg resulted in a solid relationsship which could reasonably be assumed a good fit for a predictive model, it takes a combination of all 5.
Variability of estimatedLoss/borrowerRate increases with weakening of prosperScore/prosperRating.
No models created, but there seems to be a good chance that building a model using variables estimatedLoss/borrowerRate/creditScoreRangeAvg/prosperRating/prosperScore, would result in a tool to determine if loan offered is on good terms and/or a resonable risk/reward investment.
BorrowerRate is between normal and Bi-modal distribution. Note the peak at > 0.3. Further analysis of this showed the reason being a large number of loans with HR (High Risk) rating, which are almost entirely all > 0.3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1359 0.1875 0.1960 0.2574 0.3600
“Prosper Rating”" and “Estimated Loss”" are almost perfectly correlated with no overlap and very few outliers, showing a probable deliberate and systemic relationsship between these 2 variables.
Building further on analysis and insights from univariate and bi-variate analysis, we arrive at one of many decent illustrations, in the multivariate section. This plot hightlighting the consistency across records and a set of chosen variables, which is expected in financial loan records. I.e. a very clear relationship, showing that the higher the “Average Credit Score Range”" coupled with higher “Prosper Rating” equals lower “Estimated Loss”. Also noted the increasing variability in “Estimated Loss”" as “Prosper Rating” decreases from AA to HR.
Relationship is further illustrated in the correlation matrix showing Medium to Strong associations between the 3 variables.
## estimatedLoss prosperRating creditScoreRangeAvg
## estimatedLoss 1.0000000 -0.9641819 -0.5112628
## prosperRating -0.9641819 1.0000000 0.5488738
## creditScoreRangeAvg -0.5112628 0.5488738 1.0000000
the prosperloan dataset includes information for 113,937 records with 81 variables per record. I choose to limit this to 84,853 records removing all data which was from before July 2009, where the business model changed significantly and most complete and consistent dataset is post July 2009. From these 84,853 i choose 13 variables which were reduced to 12, as 2 of the 13 chosen variables were averaged into 1.
From this data subset I first analyzed each variable individually, to find the structure and distribution of variables, and following this looked at the correlation between pairs of data. It was determined that of the chosen variables most interest is on borrowerRate and estimatedLoss. A number of other variables were quickly determined to be of less interest and with very little impact/correlation on the main variables (EmploymentStatus/Term/ListingCategory/isBorrowerHomeowner/LoanStatus).
There were a number of strong relationsships between main variables and rating variables(ProsperScore/ProsperRating/CreditScoreRating) from 0.51 to 0.96, which were further investigated and resulting finally in a 5 variable overview showing a highly connected as well as expected relationsship. I.e from a borrowerRate there seems to a predictable relationship between actual rate/variability vs. the indivual/combined scores/rating, the same relationship from an EstimatedLoss viewpoint.
There were 81 variables in the original dataset and this analysis has been done on only 13, there is therefore a risk that data from the remaining 68 variables, would reveal insights that could discount or even dissprove the findings from the analysis on chosen variables. Analysis could therefore be improved/strengthen via include further variables.
It seems likely given the analysis that a predictive model could be built using the 5 variables BorrowerRate/EstimatedLoss/CreditScoreRangeAvg/ProsperScore/ProsperRating, which could assist potential borrower in determening if borrowerRate offered is fair and consistent with peers, and assist potential loan investors in assesing the risk/reward profil on loan investment.